OcrV1, Main, Exploration, bibRecord, 001320

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Identifieur interne : 001320 ( Main/Exploration ); précédent : 001319; suivant : 001321

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Auteurs : Asanee Kawtrakul [Thaïlande]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2005.

RBID : ISTEX:6513096E6832CD8AF3E6E1C502602E021919943A

Descripteurs français

Pascal (Inist)
- Bibliothèque électronique, Détection erreur, Entrepôt donnée, Métadonnée, Nettoyage, Pertinence, Reconnaissance caractère, Reconnaissance optique caractère, Texte.

English descriptors

KwdEn :
- Character recognition, Cleaning, Data warehouse, Electronic library, Error detection, Metadata, Optical character recognition, Relevance, Text.

Abstract

Abstract: Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.

Url:

https://api.istex.fr/document/6513096E6832CD8AF3E6E1C502602E021919943A/fulltext/pdf

DOI: 10.1007/11599517_69

Affiliations:

Thaïlande

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Data Cleansing and Preparation for Moving Toward Electronic Library Repository</title>
<author><name sortKey="Kawtrakul, Asanee" sort="Kawtrakul, Asanee" uniqKey="Kawtrakul A" first="Asanee" last="Kawtrakul">Asanee Kawtrakul</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6513096E6832CD8AF3E6E1C502602E021919943A</idno>
<date when="2005" year="2005">2005</date>
<idno type="doi">10.1007/11599517_69</idno>
<idno type="url">https://api.istex.fr/document/6513096E6832CD8AF3E6E1C502602E021919943A/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001017</idno>
<idno type="wicri:Area/Istex/Curation">000F67</idno>
<idno type="wicri:Area/Istex/Checkpoint">000C29</idno>
<idno type="wicri:doubleKey">0302-9743:2005:Kawtrakul A:data:cleansing:and</idno>
<idno type="wicri:Area/Main/Merge">001356</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:06-0063154</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000413</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000374</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000419</idno>
<idno type="wicri:doubleKey">0302-9743:2005:Kawtrakul A:data:cleansing:and</idno>
<idno type="wicri:Area/Main/Merge">001449</idno>
<idno type="wicri:Area/Main/Curation">001320</idno>
<idno type="wicri:Area/Main/Exploration">001320</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Data Cleansing and Preparation for Moving Toward Electronic Library Repository</title>
<author><name sortKey="Kawtrakul, Asanee" sort="Kawtrakul, Asanee" uniqKey="Kawtrakul A" first="Asanee" last="Kawtrakul">Asanee Kawtrakul</name>
<affiliation wicri:level="1"><country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Department of Computer Engineering, Kasetsart University, 10900, Bangkok</wicri:regionArea>
<wicri:noRegion>Bangkok</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Thaïlande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2005</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">6513096E6832CD8AF3E6E1C502602E021919943A</idno>
<idno type="DOI">10.1007/11599517_69</idno>
<idno type="ChapterID">69</idno>
<idno type="ChapterID">Chap69</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Cleaning</term>
<term>Data warehouse</term>
<term>Electronic library</term>
<term>Error detection</term>
<term>Metadata</term>
<term>Optical character recognition</term>
<term>Relevance</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Bibliothèque électronique</term>
<term>Détection erreur</term>
<term>Entrepôt donnée</term>
<term>Métadonnée</term>
<term>Nettoyage</term>
<term>Pertinence</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Texte</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Manually annotated metadata usually contains errors from mistyping; however, correcting those metadata manually could be costly and time consuming. This paper proposed a framework to ease metadata correction processed by proposing a system that utilizes OCR and NLP techniques to automatically extract metadata from document image. The system firstly converts images into text using OCR and then extracts metadata from OCR results. After that, the extracted metadata are compared with the data in existing repository to locate error entries. The error entries are then displayed to users whom will correct them using supporting information. Although human decision is required to correct the error manually, this step is necessary with only error entries. The experimental results with 3,712 thesis abstracts show that the proposed solution can automatically extract the relevance information with 91.41% accuracy.</div>
</front>
</TEI>
<affiliations><list><country><li>Thaïlande</li>
</country>
</list>
<tree><country name="Thaïlande"><noRegion><name sortKey="Kawtrakul, Asanee" sort="Kawtrakul, Asanee" uniqKey="Kawtrakul A" first="Asanee" last="Kawtrakul">Asanee Kawtrakul</name>
</noRegion>
<name sortKey="Kawtrakul, Asanee" sort="Kawtrakul, Asanee" uniqKey="Kawtrakul A" first="Asanee" last="Kawtrakul">Asanee Kawtrakul</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001320 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001320 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:6513096E6832CD8AF3E6E1C502602E021919943A
   |texte=   Data Cleansing and Preparation for Moving Toward Electronic Library Repository
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

Serveur d'exploration sur l'OCR

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Data Cleansing and Preparation for Moving Toward Electronic Library Repository

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.